This notebook was created to document the steps taken to solve the Predict Students’ Ability to Repay Educational Loans posted on the Data Science Community in Coursera.
The data is aviable at:
https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv.
Documentation for the data is available at https://collegescorecard.ed.gov/data/documentation/. There is a data dictionary at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx.
The Workflow suggested in https://www.kaggle.com/startupsci/titanic-data-science-solutions is going to be followed. The Workflow is the following:
Question or problem definition.
Acquire training and testing data.
Wrangle, prepare, cleanse the data.
Analyze, identify patterns, and explore the data.
Model, predict and solve the problem.
Visualize, report, and present the problem solving steps and final solution.
Supply the results.
The workflow indicates general sequence of how each stage may follow the other. However, there are use cases with exceptions:
We may combine mulitple workflow stages. We may analyze by visualizing data.
Perform a stage earlier than indicated. We may analyze data before and after wrangling.
Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.
Test to see if a set of institutional features can be used to predict student otucomes, in particular debt repayment. This solution is intended to try to explore to what extent instututional characteristics as well as certain demographic factors can indicate or predict debt repayment.
The (US) “College Scorecard” (the data set) includes national data on the earnings of former college graduates and new data on student debt.
First import the libraries that are going to be used:
In [2]:
# data analysis and manipulation
import numpy as np
import pandas as pd
np.set_printoptions(threshold=1000)
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
#machine learning
import tensorflow as tf
#Regular expression
import re
In [3]:
all_data = pd.read_csv('datasets/CollegeScorecardData.csv')
In [4]:
all_data.head()
Out[4]:
In [5]:
all_data.info()
There are 7703 examples and 1743 features.
There are 443 float features that may be numeric, 13 integer features that may be categorical, and 1287 features that are strings, but may be numbers but data was not entered correctly (for example, if there was not data for a given feature, someone could have written "blank"). Given the high number of non numerical features, we need to explore them more. Luckly, there is a dictionary provided with the data, so we can explore it a little bit to learn about the data (The original file was converted do CSV)
In [6]:
data_dict = pd.read_csv('datasets/CollegeScorecardDataDictionary.csv')
In [7]:
data_dict.head()
Out[7]:
In [8]:
data_dict.tail()
Out[8]:
In [9]:
data_dict.info()
There are 1975 entries, but the column NAME OF DATA ELEMENT has only 1734 not nut elements, so something is up. Let's try to explore the dict a little bit more
In [10]:
data_dict[5:10]
Out[10]:
Nothing suspicius here, lets try again
In [11]:
data_dict[10:20]
Out[11]:
Aha! It seems that the feature at index 15 is categorical, and that's why the rows that follow it don't have a value under NAME OF DATA ELEMENT. Just for now, let's get rid of those NAN rows.
In [12]:
data_dict_no_nan_names = data_dict.dropna(subset=['NAME OF DATA ELEMENT'])
data_dict_no_nan_names[10:20]
Out[12]:
Lets get the info of the new dict
In [13]:
data_dict_no_nan_names.info()
We are interested primarly in the NAME OF DATA ELEMENT, VARIABLE NAME and API data type. They seem complete. Let's see howe many data types there are
In [14]:
data_dict_no_nan_names['API data type'].unique()
Out[14]:
Let's find out how many features have each data type
In [15]:
data_dict_no_nan_names['API data type'].value_counts()
Out[15]:
So in reality, there are 1206 float features, 521 integers, and 7 string features. (For now we assume that the autocomplete type is string). This numbers differ a lot from our previus analisys, in which we had 443 float features, 13 integer features and 1287 features that are strings.
Also, we cannot asume that all features of type integer are categorical, for example the ZIP code feature is integer but is not a categorical feature.
Let's find more about the autocomplete features:
In [16]:
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'autocomplete']
Out[16]:
We can see that these autocomplete features can be treated as strings.
In [17]:
all_data_no_na_columns = all_data.dropna(axis=1, how='all')
There are features that are meaningless for the problem we are trying to solve. We need to drop these features, but we need a criterion to eliminate them. The criterion that we are going to employ is to eliminate the features that are unique for every entry and don't add information to the problem, for example if we have a unique ID for every institution, this ID doesn't add information to the problem.
Also, we need to take in account that there area features that may be unique for every entry, but DOES add relevant information. For example, the tuition fees may be unique and add information.
Let's find the ratio of the number of unique values over number of examples:
In [18]:
# Create a list to save the features that are above a certain threshold
features_with_high_ratio = []
# Create a list to save the features in all_data but not in the dict
features_not_in_dict = []
#Calculate the ratio
for feature in all_data_no_na_columns.columns.values:
# Get the row in the dict wich have VARIABLE NAME == feature
row_in_dict = data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature]
# Get the data type of the row
data_type_series = row_in_dict['API data type']
#Check if exists in the dict
if data_type_series.size > 0:
# Get the data type
data_type = data_type_series.values[0]
# float features (numeric features) are not taken in account
if data_type == 'integer' or data_type == 'string' or data_type == 'autocomplete':
column = all_data_no_na_columns[feature]
column_no_na = column.dropna()
r = column_no_na.unique().size / column_no_na.size
if r > 0.8:
features_with_high_ratio.append(feature)
print(str(feature) + ": " + str(r))
#The feature is not in the dict
else:
features_not_in_dict.append(feature)
print ("\nFeatures in data but not in the dictionary:" + str(features_not_in_dict))
So there are some features in the data that are not explained in the dictionary. Tha is not necessarly an inconvenience, so we won't worry abot this right now.
Lets find what those NTP4 features are about
In [19]:
npt4_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT4_PUB'
npt41_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT41_PUB'
npt42_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT42_PUB'
data_dict_no_nan_names[npt4_pub | npt41_pub | npt42_pub ]
Out[19]:
So those NTP4 features are about Average Net prices, so they are defenetly numeric features, and it makes sense to keep them.
Let's run our previous analysis again with out those features so we can have a cleaner visualization as we lower the threshold
In [20]:
# Create a list to save the features that are above a certain threshold
features_with_high_ratio = []
# Create a list to save the features in all_data but not in the dict
features_not_in_dict = []
#Calculate the ratio
for feature in all_data_no_na_columns.columns.values:
# Get the row in the dict wich have VARIABLE NAME == feature
row_in_dict = data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature]
# Get the data type of the row
data_type_series = row_in_dict['API data type']
#Check if exists in the dict
if data_type_series.size > 0:
# Get the data type
data_type = data_type_series.values[0]
# float features (numeric features) are not taken in account
if (data_type == 'integer' or data_type == 'string' or data_type == 'autocomplete') \
and feature[:4] != 'NPT4':
column = all_data_no_na_columns[feature]
column_no_na = column.dropna()
r = column_no_na.unique().size / column_no_na.size
if r > 0.5:
features_with_high_ratio.append(feature)
print(str(feature) + ": " + str(r))
print(features_with_high_ratio)
Let's see what are these features about:
In [21]:
high_ratio_features = pd.DataFrame()
for feature in features_with_high_ratio:
high_ratio_features = high_ratio_features.append(data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature])
high_ratio_features
Out[21]:
So UNITID, OPEID, OPEID6, INSTNM, INSTURL, NPCURL and ALIAS are features that have to do with the identity of the institution, so they don't add relevant information to the problem, therfore they will be eliminated. (flag_e)
The ZIP code could be useful if it is used to group the schools to some sort of category about it's location. We are not going to to this so we are going to eliminate it as well.
In [22]:
all_data_no_id_cols = all_data_no_na_columns.drop(['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'INSTURL', 'NPCURL', 'ALIAS', 'ZIP'], axis = 1)
In [23]:
all_data_no_id_cols.head()
Out[23]:
In [24]:
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'string']
Out[24]:
We already dropped INSTURL and NPCURL. Let's explore the STABBR feature
In [25]:
all_data_no_id_cols['STABBR']
Out[25]:
So this feature has to do with the state where the school is located. Let's explore the ACCREDAGENCY feature:
In [26]:
all_data_no_id_cols['ACCREDAGENCY']
Out[26]:
In [27]:
all_data_no_id_cols['ACCREDAGENCY'].value_counts()
Out[27]:
Now les's explore the autocomplete data type:
In [28]:
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'autocomplete']
Out[28]:
INSTNM and ALIAS where dropped, let's see the CITY feature:
In [29]:
all_data_no_id_cols['CITY']
Out[29]:
So STABBR, ACCREDAGENCY and CITY are features that we are going to keep, but they need to be transformed to an ordinal (using numbers) representation, since the ML algorithms use numbers and not strings.
In [30]:
all_data_no_strings = all_data_no_id_cols.copy()
#STABBR mapping
values = all_data_no_strings['STABBR'].unique()
mapping = {}
numeric_value = 1
for value in values:
mapping[value] = numeric_value
numeric_value += 1
all_data_no_strings['STABBR'] = all_data_no_strings['STABBR'].map(mapping)
#ACCREDAGENCY mapping
values = all_data_no_id_cols['ACCREDAGENCY'].unique()
mapping = {}
numeric_value = 1
for value in values:
mapping[value] = numeric_value
numeric_value += 1
all_data_no_strings['ACCREDAGENCY'] = all_data_no_strings['ACCREDAGENCY'].map(mapping)
#CITY mapping
values = all_data_no_id_cols['CITY'].unique()
mapping = {}
numeric_value = 1
for value in values:
mapping[value] = numeric_value
numeric_value += 1
all_data_no_strings['CITY'] = all_data_no_strings['CITY'].map(mapping)
all_data_no_strings.head()
Out[30]:
Let's see how our data looks so far
In [31]:
all_data_no_strings.info()
Although we mapped or eliminated the string features, we still have a lot object (not numeric) data types. Let's work on them
In [32]:
regex = re.compile('[0-9]+(\.[0-9]+)?$')
words = []
for column in all_data_no_strings:
if all_data_no_strings[column].dtypes == 'object':
for data in all_data_no_strings[column]:
if not regex.match(str(data)):
words.append(data)
In [33]:
pd.Series(words).value_counts()
Out[33]:
We can see that there is a lot of data suppresed for privacy reasons. Also, there are dates, and one of them 12/31/2999 seems to be invalid. Let's go ahead and replace these values with nan, so we will treat it as any nan value. Also, if any column ends having all of its values as Nan, we will delete this column.
In [34]:
all_data_replaced_with_nan = all_data_no_strings.replace(to_replace = 'PrivacySuppressed', value = np.nan)
all_data_replaced_with_nan = all_data_replaced_with_nan.replace(to_replace = '12/31/2999', value = np.nan)
all_data_replaced_with_nan = all_data_replaced_with_nan.dropna(axis=1, how='all')
In [35]:
all_data_replaced_with_nan.info()
Lets find wich features are date features
In [36]:
features_with_date = []
for column in all_data_replaced_with_nan:
if all_data_replaced_with_nan[column].dtypes == 'object':
if all_data_replaced_with_nan[column].str.match('[0-9]{2}/[0-9]{2}/[0-9]{4}').any():
features_with_date.append(column)
In [37]:
features_with_date
Out[37]:
In [38]:
data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == 'SEPAR_DT_MDN']
Out[38]:
It seems that SEPAR_DT_MDN don't add valuable information to the problem, so we are going to drop it
In [39]:
all_data_no_dates = all_data_replaced_with_nan.drop(['SEPAR_DT_MDN'], axis = 1)
Now we will transfore all the object features to numeric
In [40]:
all_data_no_objects = all_data_no_dates.copy()
for feature in all_data_no_dates:
if all_data_no_dates[feature].dtypes == 'object':
#Make all data numeric
all_data_no_objects[feature] = pd.to_numeric(all_data_no_dates[feature])
In [41]:
all_data_no_objects.info()
Now we have gotten rid of the object dtype
In [42]:
high_nan_features = []
for feature in all_data_no_objects:
size = all_data_no_objects[feature].size
number_of_valid = all_data_no_objects[feature].count()
number_of_nan = size - number_of_valid
ratio = number_of_nan / size
if ratio > 0.9:
high_nan_features.append(feature)
print (len(high_nan_features))
In [43]:
all_data_no_high_nan = all_data_no_objects.drop(high_nan_features, axis = 1)
In [44]:
all_data_no_high_nan.info()
In [45]:
data_dict[15:25]
Out[45]:
We can see that after the name of a categorical feature, there is at least one item with value NaN. Let's use this to get a list of categorical features
In [46]:
categorical_features = []
is_null = data_dict['NAME OF DATA ELEMENT'].isnull()
for i in range(len(is_null) - 1):
if not is_null[i] and is_null[i+1]:
categorical_features.append(data_dict['VARIABLE NAME'][i])
To fill the missing data that belongs to a categorical feature, we will use the most common value of the data (mode). To fill the missing data that belongs to a numeric feature, we will use the the average of the data (mean).
In [47]:
all_data_no_nan = all_data_no_high_nan.copy()
for feature in all_data_no_high_nan:
if feature in categorical_features:
mode = all_data_no_high_nan[feature].mode()[0]
all_data_no_nan[feature] = all_data_no_high_nan[feature].fillna(mode)
else:
mean = all_data_no_high_nan[feature].mean()
all_data_no_nan[feature] = all_data_no_high_nan[feature].fillna(mean)
In [48]:
all_data_no_nan.head()
Out[48]:
In [49]:
all_data_no_nan.info()
Let's save the data in a file
In [50]:
all_data_no_nan.to_csv('datasets/CollegeScorecardDataCleaned.csv', index = False)
In [ ]: